🏛️ M00: Core Prompting & Context Engineering

This foundational module covers the physical, economic, and computational constraints of prompting LLMs in production systems. You will learn to optimize context windows, configure model parameters, and prevent runtime failures.

🏛️ 1. Architectural Deep Dive: Attention & KV Caching

When designing production agent loops, prompts are not simple strings. They represent input matrices processed by transformer attention mechanisms.

KV Cache & Context Attention

During inference, the model stores key-value (KV) states of previous tokens in memory (KV Cache) to avoid re-evaluating context at every token generation step.

VRAM Overhead: The size of the KV Cache scales linearly with context length and number of concurrent requests. Large contexts pressure GPU VRAM, increasing latency and TTFT (Time To First Token).
Attention Drift: As context length grows, the model's self-attention score spreads thin, causing it to ignore instructions or constraints placed in the middle of the prompt (the "Lost in the Middle" phenomenon).

Prompt Caching Internals

To mitigate KV Cache overhead, providers like Google Gemini and Anthropic Claude cache identical leading context headers at the API level.

Cache Hits: Caching is triggered for prefixes exceeding a minimum size (e.g. 1,024 tokens for Claude, 32,768 tokens for Gemini).
Byte-for-Byte Match: The cached prefix (system prompts, database schemas, code libraries) must be exactly identical. A single character change (including whitespace) invalidates the cache.
Token Economics: Cached input tokens are charged at a significantly reduced rate (up to 90% cheaper than standard input tokens).

📊 2. Tradeoff Matrix: Context Optimization Methods

Method	Latency (TTFT)	VRAM Footprint	Token Cost	Output Consistency	Primary Production Bottleneck
Zero-Shot Prompting	Ultra-Low (< 200ms)	Negligible	Very Low	Brittle / Low	Hallucinations on structured formats
Few-Shot XML Prompting	Moderate (~500ms)	Low	Low	Very High	Token inflation from repetitive examples
Context Caching	Low (after 1st run)	High (GCP managed)	Ultra-Low (90% off)	High	Cache eviction cycles on long idle states
Context Pruning	Moderate	Low	Low	High	Information loss from aggressive compression

🛠️ 3. Step-by-Step Mechanics: Structured Prompts & Tuning

To write deterministic code interfaces, we use a structured format combining XML tags and Low-Temperature Parameter Tuning.

1. XML Structured Markup

XML tags act as clear boundaries, preventing the LLM from confusing system instructions with user-submitted data payloads:

xml

<system_instructions>
You are an expert PostgreSQL database administrator.
Your goal is to output SQL statements based on user schemas.
</system_instructions>

<constraints>
- Output raw SQL only.
- Do NOT include explanation blocks.
</constraints>

<schema_context>
CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT);
</schema_context>

<user_query>
Add an email column to the users table.
</user_query>

2. Parameter Tuning Configurations

Set these parameters in your API config payloads:

temperature = 0.0: Forces the model to select the token with the absolute highest probability. This is mandatory for coding and JSON serialization tasks to prevent syntax formatting failures.
max_output_tokens: Must be set with a safety buffer (e.g. 2048 or 4096). If set too low, outputs truncate mid-sentence, causing JSON parsing crashes.
top_p & top_k: Set to default or 1.0 if temperature = 0.0. If tuning a reasoning agent, keep top_p at 0.95 to allow minor variation while pruning low-probability nonsense tokens.

🛡️ 4. Failure Mode Analysis: Mitigating Prompt Failures

Failure Mode	Log Signature / Error	Root Cause	Code Mitigation
JSON Parse Crash	`json.decoder.JSONDecodeError`	Output truncated due to low `max_output_tokens`.	Increase `max_output_tokens` or implement Pydantic validation retries.
Attention Loss	Agent ignores negative constraints.	Context window overflow or instruction placed in the middle.	Wrap target contents in XML tags; place core constraints at the end of the prompt.
Cache Invalidation	Increased token billing on consecutive runs.	Prompt prefix changed (dynamic timestamps, variables, or whitespace).	Place all dynamic inputs (user query, runtime variables) at the absolute bottom of the payload.
Rate Limiting	`ResourceExhausted (429)`	Exceeded provider TPM/RPM constraints.	Implement exponential backoff retry loops using `tenacity`.

🧪 5. Runtime Verification: What to Observe

To verify your prompt design and cache behavior:

Test 1: Cold vs. Hot Run Verification

Launch the Gemini CLI or execute a Python script loading a large (35k token) codebase context.
Observe TTFT:
- First Run (Cold): Latency will spike (~5-8s) as the API gateway compiles and caches the KV states.
- Second Run (Hot): Latency should drop to <1.5s, indicating a successful cache hit.
Audit the API request log. Confirm that the billing output logs list the cached input token counts matching your codebase context size.

Test 2: XML Parameter Tuning

Run a script that requests structured JSON using a temperature of 0.8.
Perform 50 consecutive runs. Count the number of runs where the output fails to parse (e.g., trailing commas, missing brackets).
Change temperature to 0.0 and repeat. Confirm that formatting errors drop to 0%.

🏛️ M00: Core Prompting & Context Engineering ​

🏛️ 1. Architectural Deep Dive: Attention & KV Caching ​

KV Cache & Context Attention ​

Prompt Caching Internals ​

📊 2. Tradeoff Matrix: Context Optimization Methods ​

🛠️ 3. Step-by-Step Mechanics: Structured Prompts & Tuning ​

1. XML Structured Markup ​

2. Parameter Tuning Configurations ​

🛡️ 4. Failure Mode Analysis: Mitigating Prompt Failures ​

🧪 5. Runtime Verification: What to Observe ​

Test 1: Cold vs. Hot Run Verification ​

Test 2: XML Parameter Tuning ​